logo

FORMULA 1 all time drivers analysis

First approach and explanation

GP of Brazil 1974

First of all it is interesting to explain the topic and define the question and solution we are searching:

Formula 1 is an extreme sport in which mechanical technologies and drivers faculties come together. Even for humans it is difficult to separate the machine/car component to determine which were the best drivers of all time. That why it can be interesting to ask ourselves if different algorithms can classify the different drivers.

Can unsupervised learning classify the ability of drivers without taking into account the mechanical component?

We first include in our code some useful lines and all the libraries needed.

Libraries

rm(list=ls())


#Used for visualisation
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ✔ purrr   0.3.4      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(GGally) 
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(corrplot)
## corrplot 0.92 loaded
library(dplyr)
#library(lubridate)
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## 
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## 
## The following object is masked from 'package:datasets':
## 
##     sleep
#Used for clustering
library(cluster)
library(mclust)
## Package 'mclust' version 5.4.10
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## 
## The following object is masked from 'package:VIM':
## 
##     diabetes
## 
## The following object is masked from 'package:purrr':
## 
##     map
library(kernlab)
## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha

1. IMPORT AND CLEANING THE DATA

Our data come from the same source but in different files. We will need to search the relevant information of the drivers. First we import some useful files with historical formula 1 information.

drivers = read.csv("drivers.csv")
results= read.csv("results.csv")
qualifying = read.csv("qualifying.csv")
pitstops = read.csv("pit_stops.csv")
races = read.csv("races.csv")
drivers_standings = read.csv("driver_standings.csv")
lap_times = read.csv("lap_times.csv")
sprint_results = read.csv("sprint_results.csv")

races2 = races[,(1:2)] # That's the year of each race by race ID
results= merge(results,races2, by = "raceId")

CREATING THE DATA FRAME

Now that we have all the information we create the data frame:

-driverId : number to recognize the driver

-driverName : name to recognise the driver

-nationality : Nationality under which the driver races

-total.races : Number of races the driver has started

-total_points : Number of points scored(Only 10 first drivers score*)

-mean_result_race : Mean of the whole results in the driver career

-mean_result_qualy : Mean result of the qualifying

-wins: Total wins

-podiums: Total number of podiums

-poles: Total number of poles (first position at the start on Sundays)

-retirements : Total number of retirement (not ending the race)

-constructors :Number of teams in which one driver has race

-mean_time_pitstop : Time spend in average in the pits

-seasons : Number of seasons the driver has raced( There are only 20 drivers in F1 so skills and experience have some relation)

-total_laps : Total laps a driver has done

-number_of_fastest_lap: Fastest laps in race for each driver **

-championsship: Number of championships that a driver has .(one season) only 72 seasons.

*The evolution of F1 has changed and the system of points also, we don´t take into account

**Each race gives a reward to the driver who made the fastest lap during the race

It is important to clarify that even variables as the number of races, the mean result and the points doesn’t depend exactly on each other, because of how the system of points works. In addition, we are not taking into account the sprint races a new mortality that appeared 2 years ago.

dataf1 = data.frame(matrix(nrow=854, ncol=17))

colnames(dataf1) = c("driverId", "driverName","nationality", "total.races",
                     "total_points","mean_result_race",
                     "mean_result_qualy", "wins","podiums","poles","retirements",
                     "constructors","mean_time_pitstop","seasons","total_laps",
                     "number_of_fastest_lap","championsships")
dataf1[,1] = drivers[,1] # The id
dataf1[,2] = drivers[,2]#the name
drivers1 = drivers[,2]#We keep the names 
countries = drivers[,8]#nationality
dataf1[,3] = drivers[,8]
dataf1[,17] = 0 # We will compute the championships after.



for(j in 1:854){
  i = dataf1$driverId[j]
  dataf1[j,4] = nrow(results[results$driverId== i,]) # number of races
  dataf1[j,5] = sum(results$points[results$driverId== i]) #Number of points
  dataf1[j,6] = mean(results$positionOrder[results$driverId == i]) # Result race on average
  dataf1[j,7] = mean(qualifying$position[qualifying$driverId == i])# Result qualifying on average
  dataf1[j,8] = sum(results$positionOrder[results$driverId == i]==1)# Number of wins
  dataf1[j,9] = sum(results$positionOrder[results$driverId == i]<4) # Number of podiums
  dataf1[j,10] = sum(qualifying$position[qualifying$driverId == i]==1) # Number of poles
  dataf1[j,11] = sum(results$positionText[results$driverId == i]=="R") # Number of retirements
  dataf1[j,12] = n_distinct(results$constructorId[results$driverId== i])# Number of constructors
  dataf1[j,13] = mean(pitstops$milliseconds[pitstops$driverId== i]) # Time pit stops
  dataf1[j,14] = n_distinct(results$year[results$driverId== i]) # Number of seasons
  dataf1[j,15] = sum(results$laps[results$driverId== i])  #Number of laps
  dataf1[j,16] = sum(results$rank[results$driverId == i]==1) # Number of fastest laps
}

head(dataf1)
##   driverId driverName nationality total.races total_points mean_result_race
## 1        1   hamilton     British         301       4308.5         4.707641
## 2        2   heidfeld      German         184        259.0        10.722826
## 3        3    rosberg      German         206       1594.5         8.252427
## 4        4     alonso     Spanish         349       2021.0         8.409742
## 5        5 kovalainen     Finnish         112        105.0        13.285714
## 6        6   nakajima    Japanese          36          9.0        13.083333
##   mean_result_qualy wins podiums poles retirements constructors
## 1          3.521595  103     188   106          24            2
## 2         11.100000    0      13     1          45            6
## 3          6.834951   23      57    30          29            2
## 4          8.093373   32      98    23          62            5
## 5         13.883929    1       4     1          19            5
## 6         14.027778    0       0     0           6            1
##   mean_time_pitstop seasons total_laps number_of_fastest_lap championsships
## 1          81302.43      16      17216                    61              0
## 2          22933.32      12       9699                     2              0
## 3          49760.69      11      11159                    20              0
## 4          77975.89      19      18741                    22              0
## 5          24608.67       7       5975                     2              0
## 6               NaN       3       1972                     0              0

It is important to know if the drivers are world champions but this information was not available in the data sets. But we can add it.

# We clasify the drivers for the number of championship they won.
world_champions1 = c("farina","hawthorn","phil_hill","surtees","hulme","hunt",
                     "andretti","jones", "scheckter","rosberg","keke_rosberg",
                     "villeneuve","mansell","button","raikkonen","damon_hill")
world_champions2 = c("ascari","hill","clark","fittipaldi","alonso",
                     "hakkinen","max_verstappen")
world_champions3 = c("brabham","stewart","lauda","piquet","senna")
world_champions4 = c("prost","vettel")
world_champions5 = c("fangio")
world_champions7 = c("hamilton","michael_schumacher")
for (i in 1:nrow(dataf1)){
  name = dataf1$driverName[i]
  if(name %in% world_champions1){
    dataf1$championsships[i] = 1
  }else if(name %in% world_champions2){
    dataf1$championsships[i] = 2
    
  }else if(name %in% world_champions3){
    dataf1$championsships[i] = 3
    
  }else if(name %in% world_champions4){
    dataf1$championsships[i] = 4
    
  }else if(name %in% world_champions5){
    dataf1$championsships[i] = 5
    
  }else if(name %in% world_champions7){
    dataf1$championsships[i] = 7
  
  }else{
    dataf1$championsships[i] = 0
  }
}

#We verify that the sum of championships is 72.
sum(dataf1$championsships)
## [1] 72

We have verify that there is teh same number of champions than seasons. /72

Remove missing values and completing the data frame

With this fist graph we are able to locate the variable with missing values in this case its the mean result of qualifying and the pitstop time.

#Missing values
barplot(colMeans(is.na(dataf1)), las=2)

Searching in the first data sets we can see that the data of qualifying and pitsops starts very late after 1990. That’s why we don´t have values.

In order to solve the problem of qualifying missing values, we take in to account that the qualifying result determines the starting order (without taking in to account sanctions). So it seems convenable to complete the qualifying missing values with the mean grid/starting position.

for(i in 1:854){
  if(is.na(dataf1$mean_result_qualy[i])){
    if(mean(results$grid[results$driverId== i])!=0){
      dataf1$mean_result_qualy[i] = mean(results$grid[results$driverId== i])
    }else{
      dataf1$mean_result_qualy[i] = 10
    }
  }
}
barplot(colMeans(is.na(dataf1)), las=2)

Now for the pit stop. We can say that we have a lot of NA, in total 785. Each car has to do one or more pit stop each race. The time spent on pits is usually determined by the ability of the driver to arrive in correct position and speed but even more on the mechanics. That’s why we can take out this variable.

sum(is.na(dataf1[,13]))
## [1] 785
dataf1 = dataf1[,-13]
barplot(colMeans(is.na(dataf1)), las=2)

We have now a data set without any missing value

Checking the outliers of the data set

QI <- quantile(dataf1$total_points, 0.25)
QS <- quantile(dataf1$total_points, 0.75)
IQR = QS-QI

sum(dataf1$total_points < QI - 1.5*IQR | dataf1$total_points > QS + 1.5*IQR)
## [1] 144
outliers = dataf1[which(dataf1$total_points < QI - 1.5*IQR | dataf1$total_points > QS + 1.5*IQR),2]
outliers[1:10]
##  [1] "hamilton"   "heidfeld"   "rosberg"    "alonso"     "kovalainen"
##  [6] "raikkonen"  "kubica"     "glock"      "sato"       "massa"

We have a lot of outliers but if we think it is logic. The best drivers are the ones that have outstanding statistics. If we take a look of the outliers we have some of the well known drivers. We can not take out the outliers they are relevant to our research. We will have to take the risk of using the outliers and take into account that our results are not going to be very precise.

We have done of the work of cleaning the data and preprocessing, we can start with a first approach of analysis

2. VIZUALIZATION TOOLS

a.)We want to know if the result of qualifying even if it does not make part of the race has a relevance in the result. That ’s why we create the following graph where we are comparing the result of races and qualifying for each driver we see that usually the best ones, the ones who have podiums, are as good in qualifying than in races.

ggplot(dataf1[which(dataf1$mean_result_qualy != 0),]) + aes(x= mean_result_race, y = mean_result_qualy, colour = podiums>0) +geom_point()+ scale_colour_manual(values=c('grey','red')) +labs(title="Result of  qualifying in fonction of the result of races", x = "RACES",y= "QUALIFYING", colour= "podium")

We can see that the best drivers in qualy usually have the best results in the race. That seems coherent because the best your position is the best result you have. And if a driver is good in qualifying, he will also be good in race.

b.)We want to know if the podiums and wins are difficult to obtain. Such thing will mean that those variable are important to know the hability of drivers. And also to know the amount of drivers that have a little influence in f1 acrross the years.

plot1 = dataf1[which(dataf1$podiums<1 & dataf1$total_points>0),]
ggplot(plot1) + aes(x=total.races, y= total_points, label = driverName, colour = mean_result_race) + geom_point()+
  labs(title="Races in fonction of points", x="Total Points", y="Total Races",colour= "result race") + theme(legend.position="bottom") + geom_text(size=3.2, hjust=1.5, vjust=-0.4, check_overlap = TRUE)

Nico Hulkemberg in his Renault

We do it without Hulkemberg, to see it better.

plot1 = dataf1[which(dataf1$podiums<1 & dataf1$total_points>0 & dataf1$driverName != "hulkenberg"),]
ggplot(plot1) + aes(x=total.races, y= total_points, label = driverName, colour = mean_result_race) + geom_point()+
  labs(title="Races in fonction of points", x="Total Points", y="Total Races",colour= "result race") + theme(legend.position="bottom") + geom_text(size=3.2, hjust=1.5, vjust=-0.4, check_overlap = TRUE)

We can see that having a podium is very difficult. Even drivers as Hulkenberg with more than 500 races and a pretty good mean result has not achieve a podium. It is clear that it is an extreme case but other drivers as Resta or Sutil with more than a 100 races have not a podium

c)But now we can check with wins

plot2 = dataf1[which(dataf1$wins<1 & dataf1$total_points>0),]
ggplot(plot2) + aes(x=total.races, y= total_points, label = driverName, colour = podiums) + geom_point()+
  labs(title="Points in fonction of races", x="Total Points", y="Total Races", colour = "result race") + theme(legend.position="bottom") + geom_text(size=3.2, hjust=1, vjust=1, check_overlap = TRUE)+  scale_color_gradient(low="grey", high="black")

We can see that achieve a win is very difficult there are a lot of drivers with good results and an enormous quantity of races that have not achieved a podium.

c )At the first analysis we said that the system of points has changed accross the years. With the following graph we are going to be able to see that the place at the end of the race will not determine proportionaly the number of points. This result is also justified by other things.

-If the race is not finished they will give half points

-Extra points for fastest laps

-Points are not distributed proportionally to the result.

plot3 = dataf1[which(dataf1$wins>0 & dataf1$total_points>0),]
ggplot(plot3) + aes(x=total_points, y= mean_result_race, label = driverName, colour = total.races) + geom_point()+
  labs(title="Points in fonction of races", x="Total Points", y="Result Races") + theme(legend.position="bottom") + geom_text(size=3.2, hjust=-0.6, vjust=0, check_overlap = TRUE)

We can see that drivers like Farina or Fangio that drove during the first seasons have a very good mean result and not a lot of points, neither races. On the other hand Alonso or Massa who have not so good results have more points and races. For our analysis we will then have to know that the points and the result and the number of races can not be joined. We have to take the data separately.

We will focus now on the world champions only

d.) We want to see the difference of champions across the years. How difficult it was to be world champions before and after.

plot4 = dataf1[which(dataf1$championsships>0),]
ggplot(plot4) + aes(x=podiums, y= total_points, label = driverName, colour = wins) + geom_point()+
  labs(title="Points in fonction of races", x="Total Podiums", y="Total Points") + theme(legend.position="bottom") + geom_text(size=3.2, hjust=1, vjust=0, check_overlap = TRUE)  + scale_color_gradient(low="yellow", high="darkred")

The conclusion that the old drivers have less races and points than the new ones. It could be surprising. The graph doesn’t say which one are the old drivers but searching in Google we find the period of racing. But this can be explained saying that the amount of races in the calendar is always increasing. So we can answer by saying that it was easier to be a world champion before. That’s why it is interesting to see if the classification of drivers will be close to the human classification.

Study by countries

Finally, we are going to compare the amount wins that each country has an the winners also this could be interesting to compare after the best countries in F1.

ggplot(dataf1) + aes(x = reorder(dataf1$nationality, -dataf1$championsships), height = wins,fill= dataf1$wins>0) + geom_bar()+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+scale_fill_manual(values=c('grey','gold')) +labs(title = "Drivers per country", x = "Countries", y = "number of drivers", fill= "Winners" )
## Warning: Use of `dataf1$nationality` is discouraged. Use `nationality` instead.
## Warning: Use of `dataf1$championsships` is discouraged. Use `championsships`
## instead.
## Warning: Use of `dataf1$wins` is discouraged. Use `wins` instead.

We can see that the most dominant countries of all time are USA, UK, Italy and France, this result is interesting for the rest of the analysis. But countries as Finland Brazil or Germany have better percentage of winners per driver.

PCA Study

We are now goin to make a PCA study to reorder the drivers. We first clean the data creating a new numerical data frame f1. We also check for the data shape and see that the laps are way bigger that’s why we have to scale.

f1= dataf1[,-c(1,2,3)] # Taking out non-numerical data
boxplot(f1, las=2, col="darkblue")

boxplot(scale(f1), las=2, col="darkblue") # Scaling

X = scale(f1)

The data is now scaled

We create a correlation matrix between the data to see in which measure they are correlated.

R = cor(f1)   # correlation matrix
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(R, method="color",  
         type="upper", order="hclust", 
         addCoef.col = NULL, # Add coefficient of correlation
         tl.col="black", tl.srt=45, #Text label color and rotation
         # Combine with significance
         # hide correlation coefficient on the principal diagonal
         diag=FALSE 
         )

We can see that some variables are correlated as the wins and the podiums which is normal. There are also other variables that are not correlated such as the number of constructors in which the driver has raced. It is interesting to see some variable correlated such as the total number of points and fastest_laps This can be usefull to understand how the principle components are created.

We create the principal component analysis after verifying that there is no missing values

is.na(f1) %>% sum #No missing value
## [1] 0
pca = prcomp(f1,scale = TRUE)
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.7021 1.5312 0.99151 0.87794 0.81142 0.70131 0.37421
## Proportion of Variance 0.5616 0.1804 0.07562 0.05929 0.05065 0.03783 0.01077
## Cumulative Proportion  0.5616 0.7420 0.81762 0.87691 0.92756 0.96539 0.97616
##                            PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.33802 0.28700 0.24534 0.16431 0.14722 0.06652
## Proportion of Variance 0.00879 0.00634 0.00463 0.00208 0.00167 0.00034
## Cumulative Proportion  0.98495 0.99129 0.99592 0.99799 0.99966 1.00000

We can take a look of how the importance of each dimension is distributed.

fviz_screeplot(pca, addlabels = TRUE)

This graph is very usefull to see that the first second and third dimension can explain 82,4% of the dat, which is enough for us

First component

We can see that this component gives relevance to all the wins podiums but also experience. We can expect that the first drivers are the most succesfull and the most experienced.

sum(pca$rotation[,1]^2)
## [1] 1
fviz_contrib(pca, choice = "var", axes = 1)

# The worst
drivers1[order(pca$x[,1])][(length(drivers1)-10):length(drivers1)]
##  [1] "apicella"    "heyer"       "vonlanthen"  "duncan"      "boffa"      
##  [6] "loof"        "fitzau"      "mcrae"       "bisch"       "adolff"     
## [11] "jerry_unser"

The last drivers are unknown and if we look up in Internet, some of them were reserve drivers that raced one or two times.

# The best
drivers1[order(pca$x[,1])][1:30]
##  [1] "hamilton"           "michael_schumacher" "vettel"            
##  [4] "raikkonen"          "alonso"             "prost"             
##  [7] "rosberg"            "max_verstappen"     "button"            
## [10] "barrichello"        "bottas"             "massa"             
## [13] "senna"              "piquet"             "webber"            
## [16] "lauda"              "mansell"            "ricciardo"         
## [19] "coulthard"          "hill"               "patrese"           
## [22] "hakkinen"           "berger"             "perez"             
## [25] "stewart"            "damon_hill"         "trulli"            
## [28] "fangio"             "fisichella"         "villeneuve"

The top drivers are well known and some of them are the best of history for human lists. This is the factor tthat will after be compared with the human rankings

Second component

The second component gives importance number of season or differents teams. Such things could not be very important in other sports but in F1 where only 20 drivers can compete maintain the place is very difficult. We expect at the top very experienced drivers as Alonso, hamilton, Schumacher or Vettel

barplot(pca$rotation[,2], las=2, col="darkblue")

fviz_contrib(pca, choice = "var", axes = 2)

# The worst
drivers1[order(pca$x[,2])][(length(drivers1)-10):length(drivers1)]
##  [1] "alesi"       "brundle"     "cheever"     "trintignant" "trulli"     
##  [6] "alboreto"    "jarier"      "bonnier"     "amon"        "patrese"    
## [11] "cesaris"

Once again this drivers are not well known

# The best
drivers1[order(pca$x[,2])][1:30]
##  [1] "hamilton"           "vettel"             "michael_schumacher"
##  [4] "max_verstappen"     "rosberg"            "bottas"            
##  [7] "raikkonen"          "alonso"             "fangio"            
## [10] "leclerc"            "brabham"            "jerry_unser"       
## [13] "boffa"              "bisch"              "adolff"            
## [16] "bertaggia"          "loof"               "mcrae"             
## [19] "fitzau"             "joachim_winkelhock" "duncan"            
## [22] "mcguire"            "george"             "bauer"             
## [25] "gary_brabham"       "dusio"              "kozarowitzky"      
## [28] "weidler"            "bordeu"             "monteverdi"

The names we said are in the top list, but there is also some names of drivers that are not so now by the usual people. Those are drivers that have a lot of experience but did not achieved so great results

Third component

The third component gives importance to the result of races and qualifying.

fviz_contrib(pca, choice = "var", axes = 3)

# The worst
drivers1[order(pca$x[,3])][(length(drivers1)-10):length(drivers1)]
##  [1] "banks"           "force"           "duncan_hamilton" "quester"        
##  [5] "veith"           "parsons"         "fohr"            "wallard"        
##  [9] "krause"          "ayulo"           "rose"
# The best
drivers1[order(pca$x[,3])][1:10]
##  [1] "raphanel"           "duke"               "bertaggia"         
##  [4] "andersson"          "kinnunen"           "sutton"            
##  [7] "poele"              "joachim_winkelhock" "villota"           
## [10] "mcguire"

Here we can see that the drivers we have are not the usual ones, this could be explain because some replacing drivers have very good results because they raced very few races. before doing the PCA we could think that the mean result race will be very determinant in all cases. But thinking that drivers with one good race have extremely good results in this variable it is understandable.

I think that there is interesting to see Villota in the top of this classification. This is a case of driver who raced very few races. And I wanted to highlight this example because it was one of the few women drivers, and she was Spanish, even after passing away she still is one of the main references of Spanish Motor Sport

Maria de Villota

INFLUENCE OF DRIVERS IN THE PCA

We are going to see how the drivers represent the classification. We will do it for the most important drivers.

head((pca$x[,1]^2)/(pca$sdev[1]^2))/dim(f1)[1] # this is between 0 and 1
## [1] 1.946014e-01 2.903637e-03 2.164077e-02 4.668380e-02 6.270401e-04
## [6] 2.157173e-05
names_z1 = drivers1[order(get_pca_ind(pca)$contrib[,1],decreasing=T)]
fviz_contrib(pca, choice = "ind", axes = 1, top=20)+scale_x_discrete(labels=names_z1)

We see that some drivers explain a lot of the clasification such as Hamilton. That is the consequence of taking into account the outliers and we are conscientious of the risk or bias that we took at the start

We study how the variables influence the first and second component analysis.

fviz_pca_var(pca, col.var = "contrib")

This seems coherent with the previous things explained. We can explain that the first component takes into account Experience (laps, races) and Results(wins,podiums) the mean result that can be not representative for very inexperienced drivers is not taken into account . The second component takes more into account the experience than the achievements. We do not do it with the third component because it only takes into account mean-result_qualy and mean_result_race

ANALYSIS OF THE PCA RESULTS

We are goint to check some interesting relations between the pca and other variables.

Number of podiums in function of PCA

We compare the number of podiums and only take into account PCA1 and PCA2 because there is no relation of podiums in PCA3.

data.frame(z1=-pca$x[,1],z2=-pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label=drivers1,color=f1$podiums)) + geom_point(size=0) +
  labs(title="PCA", x="PC1", y="PC2", color = "number of podiums") +
  theme_bw() + scale_color_gradient(low="lightblue", high="darkblue")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)

We can conlude with the help of this graph that there is a clear relation between the number of podiums and the Pca1, we can not say the same thing with PCA2.

Number of RETIREMENTS in fonction of PCA

We compare the number of retirements and only take into account PCA1 and PCA2 because there is no relation of podiums in PCA3.

data.frame(z1=-pca$x[,3],z2=pca$x[,1]) %>% 
  ggplot(aes(z1,z2,label=drivers1,color=f1$retirements)) + geom_point(size=0) +
  labs(title="PCA", x="PC1", y="PC2", color = "number of retirements") +
  theme_bw() + scale_color_gradient(low="orange", high="darkred")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE) 

As expected we can not find any relation between the retirement and the PCA. Usually the retirements happen because of random inccidents or failures

Mean reasult in race in fonction of PCA

We want to see if the mean result has a relation with any of the PCA

First we compare PCA1 and PCA3

data.frame(z1=-pca$x[,1],z2=-pca$x[,3]) %>% 
  ggplot(aes(z1,z2,label=drivers1,color=f1$mean_result_race)) + geom_point(size=0) +
  labs(title="PCA", x="PC1", y="PC3", color = "resul race") +
  theme_bw() + scale_color_gradient(low="lightgreen", high="darkgreen")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE) 

We can see a big relationship between PCA3 and the mean result but not with PCA1

We now compare with the PC2

data.frame(z1=-pca$x[,2],z2=pca$x[,3]) %>% 
  ggplot(aes(z1,z2,label=drivers1,color=f1$mean_result_race)) + geom_point(size=0) +
  labs(title="PCA", x="PC2", y="PC3", color = "result race") +
  theme_bw() + scale_color_gradient(low="grey", high="black")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE) 

Here again we see that the relationship between PC· and result of the race. For the PCA2 it is not so clear but the it also seems to have a bit of relation.The relation was obvious for PCA3 but it is always interesting to verify.

Clasification of countries and Conclusion PCA

We want to know which is the best country in formula one.

data.frame(z1=-pca$x[,1],country = countries) %>% group_by(country) %>% summarise(mean=mean(z1)) %>% arrange(desc(mean))
## # A tibble: 42 × 2
##    country        mean
##    <chr>         <dbl>
##  1 Finnish       4.50 
##  2 Polish        2.12 
##  3 Mexican       1.21 
##  4 Brazilian     1.07 
##  5 German        0.911
##  6 Australian    0.900
##  7 Spanish       0.900
##  8 New Zealander 0.630
##  9 Austrian      0.607
## 10 Swedish       0.547
## # … with 32 more rows

We can see that the countries are not the ones said before USA, UK, Italy and France because those are the countries that have for the drivers are the best. United States has a lot of good drivers but also a lot of bad drivers. Finland has not so many good drivers but usually all the finish drivers are very good

The SUN

drivers1[order(pca$x[,1])][1:15]
##  [1] "hamilton"           "michael_schumacher" "vettel"            
##  [4] "raikkonen"          "alonso"             "prost"             
##  [7] "rosberg"            "max_verstappen"     "button"            
## [10] "barrichello"        "bottas"             "massa"             
## [13] "senna"              "piquet"             "webber"

As conclusion we can say that the human classification(The SUN) and our first component have some similitude. And it is pretty accurate. Hamilton is the better driver of all time in both. We can see that the fact of evolution in rules has some consequences in the result. And the old drivers such Steward is 25 or Clark is not in the top 30.

FACTOR

We will start the Factor Analysis with the technique of regression:

factor<- factanal(X, factors = 4, rotation="none", scores="regression", nstart=200)
factor
## 
## Call:
## factanal(x = X, factors = 4, scores = "regression", rotation = "none",     nstart = 200)
## 
## Uniquenesses:
##           total.races          total_points      mean_result_race 
##                 0.005                 0.018                 0.867 
##     mean_result_qualy                  wins               podiums 
##                 0.894                 0.005                 0.050 
##                 poles           retirements          constructors 
##                 0.123                 0.105                 0.005 
##               seasons            total_laps number_of_fastest_lap 
##                 0.121                 0.017                 0.061 
##        championsships 
##                 0.188 
## 
## Loadings:
##                       Factor1 Factor2 Factor3 Factor4
## total.races            0.930  -0.113          -0.339 
## total_points           0.803   0.437   0.383         
## mean_result_race      -0.359                         
## mean_result_qualy     -0.322                         
## wins                   0.797   0.512  -0.108   0.294 
## podiums                0.890   0.390                 
## poles                  0.634   0.542   0.401   0.146 
## retirements            0.783  -0.347  -0.280  -0.290 
## constructors           0.619  -0.729           0.280 
## seasons                0.868  -0.349                 
## total_laps             0.935                  -0.330 
## number_of_fastest_lap  0.646   0.477   0.542         
## championsships         0.647   0.483  -0.181   0.356 
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      7.029   2.150   0.732   0.630
## Proportion Var   0.541   0.165   0.056   0.048
## Cumulative Var   0.541   0.706   0.762   0.811
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 1046.69 on 32 degrees of freedom.
## The p-value is 2.47e-199

We can see that the factor analysis explains 81% of the data, we will try other techniques to see if we can have better results

We use the Factor Analysis with the varimax and Barlett techniques

factor2 <- factanal(X, factors = 4, rotation="varimax", scores="Bartlett", lower = 0.01)
factor2
## 
## Call:
## factanal(x = X, factors = 4, scores = "Bartlett", rotation = "varimax",     lower = 0.01)
## 
## Uniquenesses:
##           total.races          total_points      mean_result_race 
##                 0.010                 0.019                 0.862 
##     mean_result_qualy                  wins               podiums 
##                 0.894                 0.010                 0.048 
##                 poles           retirements          constructors 
##                 0.120                 0.117                 0.010 
##               seasons            total_laps number_of_fastest_lap 
##                 0.115                 0.014                 0.057 
##        championsships 
##                 0.185 
## 
## Loadings:
##                       Factor1 Factor2 Factor3 Factor4
## total.races            0.915   0.333   0.182   0.104 
## total_points           0.370   0.849   0.349         
## mean_result_race      -0.316  -0.109  -0.123  -0.105 
## mean_result_qualy     -0.233  -0.107  -0.164  -0.114 
## wins                   0.298   0.495   0.809         
## podiums                0.498   0.558   0.626         
## poles                  0.110   0.836   0.411         
## retirements            0.909           0.146   0.190 
## constructors           0.529                   0.841 
## seasons                0.790   0.184   0.199   0.433 
## total_laps             0.880   0.406   0.207         
## number_of_fastest_lap  0.182   0.925   0.232         
## championsships         0.182   0.355   0.808         
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      4.047   3.293   2.222   0.978
## Proportion Var   0.311   0.253   0.171   0.075
## Cumulative Var   0.311   0.565   0.736   0.811
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 1091.35 on 32 degrees of freedom.
## The p-value is 9.23e-209

With this techniques we can explain more than 80% of the data, the same result. We will continue with this result.

We now analyse the factors:

par(mfrow=c(2,2))
barplot(factor2$loadings[,1], names=F, las=2, col="darkblue", ylim = c(-1, 1))
barplot(factor2$loadings[,2], names=F, las=2, col="darkblue", ylim = c(-1, 1))
barplot(factor2$loadings[,3], las=2, col="darkblue", ylim = c(-1, 1))
barplot(factor2$loadings[,4], las=2, col="darkblue", ylim = c(-1, 1))

Those graphs can help us to say that:

-The first factor takes into account experience with variables like total_races, seasons or total_laps

-The second factor takes into account the achievements during race with variables like fastest-lap, pole or total_points.

-The third factor takes into account the achievements of a carrier with variables like championships, wins or podiums.

-The last factor takes into account the different car experiences with variables like retirements, constructors and seasons.

factor.df = data.frame(drivers=drivers1, factor2$scores) %>% gather("factor", "score", -drivers)

par(mfrow=c(3,1))
ggplot() +aes(x = drivers1 , y = factor2$scores[,1]) + geom_point() +labs(x = "drivers", y = "Factor1")+theme(axis.text.x=element_blank())

ggplot() +aes(x = drivers1 , y = factor2$scores[,2]) + geom_point() +labs(x = "drivers", y = "Factor2")+theme(axis.text.x=element_blank())

ggplot() +aes(x = drivers1 , y = factor2$scores[,3]) + geom_point() +labs(x = "drivers", y = "Factor3")+theme(axis.text.x=element_blank())

ggplot() +aes(x = drivers1 , y = factor2$scores[,4]) + geom_point() +labs(x = "drivers", y = "Factor4")+theme(axis.text.x=element_blank())

From those results we want to know if the historic time of racing is determine some of the factors. We have to precise that drivers from 2022 till 2016 are the last ones and the first ones are from 2016 till 1950. The first factor of experience doesn’t seem to change in function of the drivers. The second factor seems to give more importance to the oldest drivers maybe because because they were more dominate in comparison to they rivals. The last factor doesn`t seems to be biased by the period of time. In conclusion we can say that the data doesn’t seem to be very affected by the different period of time and rules of F1. Here we clearly see the consequences of taking into account the outliers

CLUSTERING

We can compute the clustering in different ways. But at first we may compute the number of clusters that we may have. For this there are different techniques.

-WSS

-SILHOUETTE

We will use the technique of WSS and the GAP_STAT:

fviz_nbclust(X, kmeans, method = 'wss')

fviz_nbclust(X, kmeans, method = 'gap_stat', k.max = 20)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

We can see that the optimal number of clusters is between 2 and 4

We are going to try different clustering techniques to see which one fits better:

-KMEANS

-PAM CLUSTERING

-KERNEL CLUSTERING

-GAPSTAT

-CLUSTERING EM

CLUSTERING BY KMEANS

We compute the group with 3 centers and see how many values there is in each group

set.seed(345)
fit = kmeans(X, centers=4, nstart=100)
groups = fit$cluster
centers=fit$centers
barplot(table(groups), border="red", col="white")

table(groups)
## groups
##   1   2   3   4 
##  66 183   5 600

We can see that there is a group with 5 drivers only. Those are the super drivers known by everybody. They have such outstanding stats that they are the outliers that we could not take out. The first group are good drivers but not in the level of the super champions. The third group is composed by the not so good drivers.

fviz_cluster(fit, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
  theme_minimal() + geom_text(label=drivers1,hjust=0, vjust=0,size=2,check_overlap=T)+scale_fill_brewer(palette="Paired")

We can see that the drivers are divided on four groups. The first one is composed by the outstanding drivers. The first group is composed by good drivers that had a good place in the category. The third group is composed by not so good drivers. It is the group with more drivers which is normal. The last group is composed of reserve drivers

d <- dist(X, method="euclidean")  
sil = silhouette(groups, d)
plot(sil, col=1:4, main="", border=NA)

summary(sil)
## Silhouette of 854 units in 4 clusters from silhouette.default(x = groups, dist = d) :
##  Cluster sizes and average silhouette widths:
##         66        183          5        600 
## 0.08048741 0.22609912 0.19524388 0.37131702 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2353  0.2309  0.3480  0.3167  0.4314  0.5316

The silhouette plot confirms the data that we had. And the fact that the number of outliers is having some consequence in the groups.

PAM CLUSTERING

We do the PAM clustering:

fit.pam <- eclust(X, "pam", stand=F, k=3, graph=F)

fviz_cluster(fit.pam, data = X, geom = c("point"), pointsize=1)+
  theme_minimal()+geom_text(label=drivers1 ,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Paired")

centers2=fit.pam$medoids
par(mfrow=c(3,1))
barplot(centers2[1,], names=F, las=2, col="white", border = "red")
barplot(centers2[2,], names=F, las=2, col="white", border = "red")
barplot(centers2[3,], las=2, col="white", border = "red")

This type of clustering seems less affected by the outliers and divides in three groups

-1 more experimented drivers

-2 drivers with less experience but good results

-3 less experience and less results but similar to 2

Kernel k-means

We do the Kernel clustering:

fit.ker <- kkmeans(as.matrix(X), centers=3, kernel="rbfdot") 
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
centers(fit.ker)
##            [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
## [1,] -0.4057355 -0.2081425 -0.2565222  0.8311304 -0.1922957 -0.2557508
## [2,]  0.8028810  0.3951626 -0.5005969 -0.3272513  0.3648028  0.4877370
## [3,] -0.4436365 -0.2089776  0.8418627 -0.5577857 -0.1927618 -0.2592138
##            [,7]       [,8]       [,9]      [,10]      [,11]      [,12]
## [1,] -0.1084031 -0.4032351 -0.5208294 -0.5066433 -0.3697649 -0.1220206
## [2,]  0.2054147  0.8150222  0.9318843  0.9803623  0.7711338  0.2292341
## [3,] -0.1084031 -0.4599261 -0.4595548 -0.5292594 -0.4482022 -0.1198114
##           [,13]
## [1,] -0.1596338
## [2,]  0.3024925
## [3,] -0.1596338
size(fit.ker)
## [1] 294 295 265
withinss(fit.ker)
## [1] 1174.703 9180.815 1257.722
object.ker = list(data = X, cluster = fit.ker@.Data)
fviz_cluster(object.ker, geom = c("point"), ellipse=T,pointsize=1)+
  theme_minimal()+geom_text(label=drivers1,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Paired")

With this clustering we can see three groups with more or less the same amount of data. It fits better the model. The first group here again are the more experienced drivers and with better results. The cluster 1 is the one represented by medium drivers and the third and last group is represented by those drivers with less experience and poorer results

This is the best cluster technique for the moment.

Clustering EM

We compute the clustering by Expetation-Maximization:

res.Mclust <- Mclust(scale(X)) # This clustering works with probabilities
summary(res.Mclust)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EEV (ellipsoidal, equal volume and shape) model with 8 components: 
## 
##  log-likelihood   n  df      BIC      ICL
##        7426.428 854 748 9803.907 9550.536
## 
## Clustering table:
##   1   2   3   4   5   6   7   8 
##  24  24  24  19  39  80  91 553
head(res.Mclust$z)
##               [,1]         [,2]         [,3]         [,4]        [,5]
## [1,]  1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000000
## [2,] 1.651085e-126 0.000000e+00 1.000000e+00 0.000000e+00 0.000000000
## [3,]  1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000000
## [4,]  1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000000
## [5,]  0.000000e+00 0.000000e+00 1.000000e+00 0.000000e+00 0.000000000
## [6,]  0.000000e+00 2.002976e-20 8.098126e-07 5.559033e-16 0.009517102
##            [,6]        [,7]      [,8]
## [1,] 0.00000000 0.000000000 0.0000000
## [2,] 0.00000000 0.000000000 0.0000000
## [3,] 0.00000000 0.000000000 0.0000000
## [4,] 0.00000000 0.000000000 0.0000000
## [5,] 0.00000000 0.000000000 0.0000000
## [6,] 0.05091119 0.003388732 0.9361822
head(res.Mclust$classification)
## [1] 1 3 1 1 3 8

We can see that the clustering creates 8 groups and the last one is much bigger than the rest, it is not going to be the best model

X2 <- scale(X)
X2$mclust <-
fviz_mclust(object = res.Mclust, what = "BIC", pallete = "jco") +
  scale_x_discrete(limits = c(1:10))
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
## Warning in X2$mclust <- fviz_mclust(object = res.Mclust, what = "BIC", pallete =
## "jco") + : Realizando coercion de LHD a una lista

We plot it:

fviz_mclust(object = res.Mclust, what = "classification", geom = "point",
            pallete = "jco")

This plot is not as useful as the kernel k-means clustering.

Dendogram (not clear enought)

In order to vizualize the data and the groups we can create a Dendogram. We can imagine that with the amount of drivers the dendogram will not be clear to read.

d = dist(scale(X), method = "euclidean")
hc <- hclust(d, method = "ward.D2")

hc$labels <- drivers1

fviz_dend(x = hc, k=5,palette = "jco",rect = TRUE, rect_fill = TRUE,rect_border = "jco") +  labs(title="Drivers of the F1 History Dendogram") 
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

In fact this dendogram is not clear at all because of the amount of data In order to visualize it better we will create one dendogram with the countries.

CONCLUSION

It could be interesting to see some analysing with the countries:

We have to create data frame the same way we did for the drivers

country = data.frame(matrix(nrow= length(unique(countries)), ncol = 13))
colnames(country) = c("nationality", "total.races",
                      "total_points","mean_result_race",
                      "mean_result_qualy", "wins","podiums","poles","retirements",
                      "constructors","seasons","total_laps","championsships")

country[,1] = unique(countries)
for(j in 1:length(unique(countries))){
  i = country[j,1]
  country[j,2] = sum(dataf1$total.races[which(dataf1$nationality==i)])
  country[j,3] = sum(dataf1$total_points[which(dataf1$nationality==i)])
  country[j,4] = mean(dataf1$mean_result_race[which(dataf1$nationality==i)])
  country[j,5] = mean(dataf1$mean_result_qualy[which(dataf1$nationality==i)])
  country[j,6] = sum(dataf1$wins[which(dataf1$nationality==i)])
  country[j,7] = sum(dataf1$podiums[which(dataf1$nationality==i)])
  country[j,8] = sum(dataf1$poles[which(dataf1$nationality==i)])
  country[j,9] = sum(dataf1$retirements[which(dataf1$nationality==i)])
  country[j,10] = sum(dataf1$constructors[which(dataf1$nationality==i)])
  country[j,11] = sum(dataf1$seasons[which(dataf1$nationality==i)])
  country[j,12] = sum(dataf1$total_laps[which(dataf1$nationality==i)])
  country[j,13] = sum(dataf1$championsships[which(dataf1$nationality==i)])
}
countryf1 = country[,-1]
Y = scale(countryf1)

head(countryf1)
##   total.races total_points mean_result_race mean_result_qualy wins podiums
## 1        4391     10325.64         16.06373         13.571402  308     731
## 2        2366      7925.50         18.06777         15.041690  179     415
## 3         804      2783.50         16.85340         12.639424   33     112
## 4        1138      4375.50         15.86533          9.496173   57     245
## 5         626       252.00         16.80469         16.839724    0       3
## 6        2986      3422.33         14.99599         14.958519   81     309
##   poles retirements constructors seasons total_laps championsships
## 1   136        1531          430     594     197424             20
## 2   131         610          126     213     111795             12
## 3    24         209           31      65      38995              2
## 4    49         287           34      72      56733              4
## 5     0         259           39      56      25541              0
## 6     1        1111          194     297     130834              4

We can create the heat maps of the drivers.

This heat map has to many drivers but we can see a clearer one with the countries

heatmap(scale(X), scale = "none",
        distfun = function(x){dist(x, method = "euclidean")},
        hclustfun = function(x){hclust(x, method = "ward.D2")},
        cexRow = 0.7,labRow = drivers1)

This is the heatmap of the countries

heatmap(scale(Y), scale = "none",
        distfun = function(x){dist(x, method = "euclidean")},
        hclustfun = function(x){hclust(x, method = "ward.D2")},
        cexRow = 0.7,labRow =unique(countries))

This heat map is very interesting. Countries like UK,USA,Italy nad France are grouped together. That represents very well the reality

d = dist(scale(Y), method = "euclidean")
hc <- hclust(d, method = "ward.D2")

hc$labels <- unique(countries)
fviz_dend(x = hc, k = 5,color_labels_by_k = TRUE,cex = 0.8, type = "phylogenic",repel = TRUE)+  labs(title="Countries of the F1 History Dendogram") 

This dendogram is much clearer , and as before we have the same all countries put together. Britain in fact is apart maybe because of the outlier Hamilton. Moreover Brasil, Finland and Germany all also considered good countries. It is logic because of their drivers level

As final conclusion, we can say that the unsupervised learning techniques approach in some ways very well the reality. We have tell that it was not evident to obtain those results because of the changes in the sport. The question Is the unsupervised learning able to surpass the fact that the driver ability goes beyond the result of the races? and the answer is yes!

Carlos Sainz in his Ferrari

SOURCES

The data comes from : https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020 The images are free of use. Some graph use the same techniques used in practical classes.

Jaime Salafranca